feat(sinker): Ship 8 — Yuv420p RGBA output via const-ALPHA template#16
feat(sinker): Ship 8 — Yuv420p RGBA output via const-ALPHA template#16
Conversation
There was a problem hiding this comment.
Pull request overview
Adds end-to-end 8-bit RGBA output for the Yuv420p pipeline by extending MixedSinker with an RGBA buffer path and introducing a const-generic <const ALPHA: bool> kernel template that’s implemented across scalar + all SIMD backends.
Changes:
- Add
MixedSinker<Yuv420p>::with_rgba/set_rgba, newMixedSinkerErrorvariants, and aprocess()RGBA kernel path. - Add public
row::yuv_420_to_rgba_rowdispatcher plus scalar + SIMD backend implementations using the shared const-generic template. - Add format-level and per-backend scalar-equivalence tests for RGBA output.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
src/sinker/mixed.rs |
Adds RGBA buffer plumbing for MixedSinker<Yuv420p>, new error variants, and RGBA-specific tests. |
src/row/mod.rs |
Exposes yuv_420_to_rgba_row dispatcher and adds rgba_row_bytes() helper. |
src/row/scalar.rs |
Introduces yuv_420_to_rgba_row and shared scalar <const ALPHA: bool> template. |
src/row/arch/x86_common.rs |
Adds write_rgba_16 helper for x86 shuffle-based RGBA stores. |
src/row/arch/x86_sse41.rs |
Implements SSE4.1 RGBA row kernel and equivalence tests using write_rgba_16. |
src/row/arch/x86_avx2.rs |
Implements AVX2 RGBA row kernel + write_rgba_32 and equivalence tests. |
src/row/arch/x86_avx512.rs |
Implements AVX-512 RGBA row kernel + write_rgba_64 and equivalence tests. |
src/row/arch/neon.rs |
Implements NEON RGBA row kernel via vst4q_u8 and equivalence tests. |
src/row/arch/wasm_simd128.rs |
Implements wasm simd128 RGBA row kernel + write_rgba_16 and equivalence tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /// `u16` RGBA buffer attached via [`MixedSinker::with_rgba_u16`] / | ||
| /// [`MixedSinker::set_rgba_u16`] is shorter than `width × height × 4` | ||
| /// `u16` elements. Only high‑bit‑depth source impls write into this | ||
| /// buffer; the fourth `u16` per pixel is alpha — opaque | ||
| /// (`(1 << BITS) - 1`) by default when the source has no alpha | ||
| /// plane. |
There was a problem hiding this comment.
The intra-doc links in this variant’s docs point to MixedSinker::with_rgba_u16 / set_rgba_u16, but those methods don’t exist anywhere in the crate yet. This will trigger rustdoc::broken_intra_doc_links warnings (and may fail CI if docs are built with warnings-as-errors). Consider removing the links for now (leave them as code-formatted text), or introducing the corresponding APIs in the appropriate format-specific impl blocks before linking to them.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Tranche 2 of Ship 8 sink-side RGBA. Mass-applies the const-generic-ALPHA template proven by PR #16 to the **4:2:0 semi-planar family** — NV12 (UV-ordered, FFmpeg's `nv12`, the most common HW-decoder output across CUDA / NVDEC / VideoToolbox / VAAPI / Android MediaCodec / QSV) and NV21 (VU-ordered, Android camera default).
#18) Tranche 3 of Ship 8 sink-side RGBA. **Wiring-only PR** — no new SIMD code. The 4:2:2 formats reuse the 4:2:0 kernels from tranches 1 + 2: `Yuv422p` calls `yuv_420_to_rgba_row` (already shipped in PR #16), and `Nv16` calls `nv12_to_rgba_row` (already shipped in PR #17). 4:2:2's per-row contract is identical to 4:2:0's — half-width chroma, one pair per Y pair — so the same kernel handles both with no changes.
First PR of Ship 8 (sink-side
with_rgba/with_rgba_u16for every existing source format). Lands the template-validating slice: end-to-end RGBA support forYuv420ponly — public API, all five SIMD backends, MixedSinker integration, per-backend equivalence tests. The remaining ~30 kernel families (NV12, NV21, Yuv422p, Yuv444p, all bit-depth variants, P010 / P016 / P210 / P410, etc.) inherit the same const-generic-ALPHA template via mechanical mass-apply in follow-up PRs — see What's deferred below.Scope
Sink-side only (the high-value half per
docs/color-conversion-functions.md§ Ship 8). YUVA source frames + per-pixel alpha pass-through are deferred to a follow-up PR. Default alpha is0xFF(opaque) for sources that don't carry an alpha plane — every YUV format shipped today.Design: const-generic
ALPHA: boolkernel templateThe natural alternatives:
<const ALPHA: bool>; the only delta is the final per-block store (vst3q_u8↔vst4q_u8on NEON; newwrite_rgba_*shuffle helpers on x86 / wasm parallel to the existingwrite_rgb_*). Same speed as native, ~1/5th the LOC. Compiler DCE eliminates theif ALPHAbranch and the unused alpha-vector splat at each call site.All five backends keep R / G / B in separate vectors at store time, so adding alpha is purely a 4th vector + different shuffle/store — no re-interleave.
What's in this PR
Public API (
colconv::sinker)MixedSinker<Yuv420p>::with_rgba(&mut [u8])andset_rgba— attaches a packed 32-bit RGBA buffer (width × height × 4bytes). Format-specific impl block (not on the genericMixedSinker<F>), so attaching RGBA to a sink whoseprocessdoesn't write it is a compile error rather than a silent stale-buffer bug. Future formats add their own impl blocks as RGBA support lands; acompile_faildoctest pins this property.MixedSinker::produces_rgba()/produces_rgba_u16()— generic queries (alwaysfalsefor sinks without a write path; symmetrical with the existingproduces_rgb_u16()).MixedSinkerError::RgbaBufferTooShort { expected, actual }andRgbaU16BufferTooShort— new error variants.row::yuv_420_to_rgba_row(...)— public dispatcher mirroringyuv_420_to_rgb_row(...). Same contract;rgba_out.len() >= 4 * width.Per-backend kernels (
src/row/)scalar.rsyuv_420_to_rgba_row+ sharedyuv_420_to_rgb_or_rgba_row::<const ALPHA: bool>templatearch/x86_common.rswrite_rgba_16(4 shuffle masks × 4 stores per 16 pixels = 64 bytes)arch/x86_sse41.rsyuv_420_to_rgba_row+ shared template; useswrite_rgba_16arch/x86_avx2.rsyuv_420_to_rgba_row+write_rgba_32(two halves throughwrite_rgba_16)arch/x86_avx512.rsyuv_420_to_rgba_row+write_rgba_64(four quarters throughwrite_rgba_16)arch/neon.rsyuv_420_to_rgba_row+ shared template; uses nativevst4q_u8arch/wasm_simd128.rsyuv_420_to_rgba_row+ newwrite_rgba_16(4 swizzle masks × 4 stores)mod.rsrgba_row_bytes(width)helper + publicyuv_420_to_rgba_rowdispatcherThe 4-byte stride aligns cleanly with the 16-byte register width on every backend, so the RGBA shuffle masks are simpler than the 3-byte RGB pattern (no channel "split across blocks" at the 16-byte boundary). NEON's
vst4q_u8is a native 4-channel store — zero shuffle overhead vs. the 3-channelvst3q_u8.MixedSinker integration
MixedSinker<Yuv420p>::processruns the RGBA path as an independent kernel call (not compose). The const-generic-ALPHA monomorphization meansyuv_420_to_rgba_row::<true>andyuv_420_to_rgb_row::<false>are separate functions sharing one source body — both fully inlined at their respective call sites with the unused branch eliminated.SIMD coverage
yuv_420_to_rgba_rowBlock sizes per iteration mirror the existing RGB kernels: NEON / SSE4.1 / wasm = 16 px; AVX2 = 32 px; AVX-512 = 64 px.
Tests
Format-level (5 tests in
src/sinker/mixed.rs):rgba_only_converts_gray_to_gray_with_opaque_alpha— gray YUV → gray RGB + alpha =0xFF.rgba_alpha_is_opaque_for_arbitrary_color— alpha is0xFFfor non-gray content.with_rgb_and_with_rgba_produce_byte_identical_rgb_bytes— cross-format invariant: alpha is the only difference betweenwith_rgbandwith_rgbaoutputs.rgba_with_simd_false_matches_with_simd_true— SIMD ≡ scalar parity across 8 widths covering all backend block sizes (16 / 32 / 64) plus tails.rgba_buffer_too_short_returns_err— short buffer rejected atwith_rgbacall.yuv_420_to_rgba_simd_matches_scalar_with_random_yuv— pseudo-random per-pixel YUV (width 1922, all 4 matrices × both ranges) catches lane-order corruption that solid-color tests would miss.Per-backend equivalence (18 tests in
src/row/arch/*.rs):unsafe yuv_420_to_rgba_rowdirectly under runtime feature detection, comparing againstscalar::yuv_420_to_rgba_rowbyte-for-byte. Bypasses the dispatcher so every backend gets exercised on every CI runner regardless of which tier the dispatcher would pick.(byte, pixel, channel R/G/B/A, width, matrix, full_range, scalar_value, simd_value)— first divergence is locally diagnosable.Compile-fail doctest in
MixedSinker<Yuv420p>::with_rgbaproves attaching RGBA to a non-Yuv420p sink (e.g.MixedSinker::<Nv12>::new(...).with_rgba(...)) is a compile error.Local results (aarch64 macOS): 429 lib tests + 1 doctest pass.
CI matrix coverage (already wired in
.github/workflows/):test-sde-avx512(Intel SDE Ice Lake) — AVX-512 path.testjobs on AMD EPYC ubuntu-latest / macOS aarch64 — AVX2 / NEON.coverage.ymlper-tier matrix with--cfg colconv_disable_avx512/--cfg colconv_disable_avx2/--cfg colconv_force_scalar— exercises SSE4.1 + scalar fallback paths.crossjob —wasm32-wasip1test suite exercises the wasm RGBA tests.What's deferred
Out of scope for this PR (follow-up work):
with_rgba_u16— native-depthu16RGBA output for high-bit-depth sources. Error variant + format-specific impl pattern is already in place; just needs wiring per format.Yuv420pAFrameetc.) — when a source carries an alpha plane, the kernel copies it through to the RGBA buffer instead of writing the opaque default. Doc roadmap calls this the "source-side half" of Ship 8.yuv_420_to_rgbavs.yuv_420_to_rgb— confirm the const-ALPHA path doesn't regress the RGB baseline and measure the RGBA throughput delta. Mechanically straightforward; deferred to keep this PR focused.Test plan
test,test-sde-avx512,cross,coverage,clippy,buildjobs.MixedSinker::<Nv12>::new(...).with_rgba(...).cargo doc --lib --no-depsclean (no new doc warnings vs. main).🤖 Generated with Claude Code